training parameter
Visualizing the Loss Landscape of Neural Nets
Neural network training relies on our ability to find good minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple filter normalization method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
Visualizing the Loss Landscape of Neural Nets
Neural network training relies on our ability to find good minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple filter normalization method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition
Huang, Ying, Man, Yuanbin, Jia, Wenqi, Tu, Zhengzhong, Huang, Junzhou, Yin, Miao
Adapter-based fine-tuning has gained remarkable attention in adapting large pre-trained vision language models (VLMs) for a wide range of downstream tasks efficiently. In this paradigm, only the inserted adapters are fine-tuned, without the need for training the original VLM backbone. Existing works scale adapters by integrating them into every layer of VLMs to increase the capacity of adapters. However, these methods face two primary limitations: 1) limited compression rate due to ignoring cross-layer redundancy, and 2) limited representational capacity across homogeneous adapters. In this paper, we propose a novel vision-language fine-tuning framework based on cross-layer tensor ring decomposition (TRD) with the integration and collaboration of diverse adapters, called AdaRing, achieving ultra-light parameter-efficient adaptation of VLMs on various tasks. T o remove the high redundancy that exists among adapters across layers, we exploit the tensor-level low-rankness to formulate adapters as layer-shared tensor cores and layer-specific slices. Moreover, guided by generalization-aware fine-tuning, diverse rank-driven adapters cooperate to handle tasks that require different representations. Our experiments show that the proposed AdaRing achieves the state-of-the-art performance while reducing average training parameters by 90%.
UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking
Wang, He, Xu, Tianyang, Tang, Zhangyong, Wu, Xiao-Jun, Kittler, Josef
--Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. T o achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a T ask-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at https://github.com/wanghe/UASTrack. Index T erms --Multi-modal object tracking, Unified multi-modal tracking tasks, Adaptive task recognition. Isual object tracking [1]-[4] is a crucial research area in computer vision, focusing on estimating the position and size of an object throughout a video sequence, beginning with the object initial state in the first frame. Recent advancements highlight the limitations of relying solely on visible sensors, leading to increased interest in utilizing auxiliary modalities such as thermal (T) [5], event (E) [6], and depth (D) [7]. He Wang, Tianyang Xu, Zhangyong Tang, Shaochuan Zhao, and Xiao-Jun Wu (Corresponding author) are with the School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China (email: 7243115005@stu.jiangnan.edu.cn;
Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure
Wang, Zhicong, Ma, Zicheng, Cao, Ziqiang, Zhou, Changlong, Zhang, Jun, Gao, Yiqin
Proteins play a pivotal role in living organisms, yet understanding their functions presents significant challenges, including the limited flexibility of classification-based methods, the inability to effectively leverage spatial structural information, and the lack of systematic evaluation metrics for protein Q&A systems. To address these limitations, we propose Prot2Chat, a novel framework that integrates multimodal protein representations with natural language through a unified module, enabling large language model (LLM)-driven answer generation. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, a protein-text adapter with cross-attention mechanisms, and a LLaMA3 decoder. To optimize training efficiency, we freeze the encoder and employ LoRA techniques for the decoder. We conducted experiments on two datasets, both automated metrics and expert evaluations demonstrate the superior performance of our model. Furthermore, zero-shot prediction results highlight its strong generalization capabilities. This framework offers a promising solution for bridging protein domain knowledge with natural language understanding, paving the way for transformative advancements in protein-related research.
Pilot: Building the Federated Multimodal Instruction Tuning Framework
Xiong, Baochen, Yang, Xiaoshan, Song, Yaguang, Wang, Yaowei, Xu, Changsheng
In this paper, we explore a novel federated multimodal instruction tuning task(FedMIT), which is significant for collaboratively fine-tuning MLLMs on different types of multimodal instruction data on distributed devices. To solve the new task, we propose a federated multimodal instruction tuning framework(Pilot). Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction. Each client can not only capture personalized information of local data and learn task-related multimodal information, but also learn general knowledge from other tasks. In addition, we introduce an adaptive parameter aggregation strategy for text training parameters, which optimizes parameter aggregation by calculating weights based on the euclidean distance between parameters, so that parameter aggregation can benefit from positive effects to the greatest extent while effectively reducing negative effects. Our framework can collaboratively exploit distributed data from different local clients to learn cross-task knowledge without being affected by the task heterogeneity during instruction tuning. The effectiveness of our method is verified in two different cross-task scenarios.
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
Xu, Lang, Anthony, Quentin, Hatef, Jacob, Shafi, Aamir, Subramoni, Hari, K., Dhabaleswar, Panda, null
Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. In this paper, we propose a 3-level hierarchical partitioning specifically for the current Top-1 supercomputing cluster, Frontier, which aims at leveraging various bandwidths across layers of communications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication overhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs. To the best of our knowledge, our work is also the first effort to efficiently optimize LLM workloads on Frontier AMD GPUs.
SurvBETA: Ensemble-Based Survival Models Using Beran Estimators and Several Attention Mechanisms
Utkin, Lev V., Khomets, Semen P., Efremenko, Vlada A., Konstantinov, Andrei V.
Many ensemble-based models have been proposed to solve machine learning problems in the survival analysis framework, including random survival forests, the gradient boosting machine with weak survival models, ensembles of the Cox models. To extend the set of models, a new ensemble-based model called SurvBETA (the Survival Beran estimator Ensemble using Three Attention mechanisms) is proposed where the Beran estimator is used as a weak learner in the ensemble. The Beran estimator can be regarded as a kernel regression model taking into account the relationship between instances. Outputs of weak learners in the form of conditional survival functions are aggregated with attention weights taking into account the distance between the analyzed instance and prototypes of all bootstrap samples. The attention mechanism is used three times: for implementation of the Beran estimators, for determining specific prototypes of bootstrap samples and for aggregating the weak model predictions. The proposed model is presented in two forms: in a general form requiring to solve a complex optimization problem for its training; in a simplified form by considering a special representation of the attention weights by means of the imprecise Huber's contamination model which leads to solving a simple optimization problem. Numerical experiments illustrate properties of the model on synthetic data and compare the model with other survival models on real data. A code implementing the proposed model is publicly available.
Univariate Conditional Variational Autoencoder for Morphogenic Patterns Design in Frontal Polymerization-Based Manufacturing
Liu, Qibang, Cai, Pengfei, Abueidda, Diab, Vyas, Sagar, Koric, Seid, Gomez-Bombarelli, Rafael, Geubelle, Philippe
Under some initial and boundary conditions, the rapid reaction-thermal diffusion process taking place during frontal polymerization (FP) destabilizes the planar mode of front propagation, leading to spatially varying, complex hierarchical patterns in thermoset polymeric materials. Although modern reaction-diffusion models can predict the patterns resulting from unstable FP, the inverse design of patterns, which aims to retrieve process conditions that produce a desired pattern, remains an open challenge due to the non-unique and non-intuitive mapping between process conditions and manufactured patterns. In this work, we propose a probabilistic generative model named univariate conditional variational autoencoder (UcVAE) for the inverse design of hierarchical patterns in FP-based manufacturing. Unlike the cVAE, which encodes both the design space and the design target, the UcVAE encodes only the design space. In the encoder of the UcVAE, the number of training parameters is significantly reduced compared to the cVAE, resulting in a shorter training time while maintaining comparable performance. Given desired pattern images, the trained UcVAE can generate multiple process condition solutions that produce high-fidelity hierarchical patterns.
SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers
Chekalina, Viktoriia, Rudenko, Anna, Mezentsev, Gleb, Mikhalev, Alexander, Panchenko, Alexander, Oseledets, Ivan
The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.